1 Introduction


Objectives: The goal of this kernel is to find the best approach to identify the quality of the wine. We will go through the basic EDA and visually identify the quality via a 3D interactive plot. Moreover, I also applied multiple ML models to make the prediction respectively. Each of the models would have its own strength.

Rpart can give you a nice decision tree plot so you can see the variable more intuitively.

Random Forest is the model most of the time you can run directly with minimum amount of tuning.

xgboost is expected to produce the best result but needs a bit of tuning.

svm is an alternative approach and usually give a less correlated result.

h2o - deeplearning is one of the easiest tool to apply deep learning model. I could potentially use keras but due to the size and the structure of data. I don’t believe deep learning model would outperform xgboost in this case.

Confusion Matrix is used to evaluate the results.

If you have any question, please leave a comment and if you like the kernel, please give me an upvote~ Thanks!


2 Basic Set up



2.1 Load Packages


if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, skimr, GGally, plotly, viridis, caret, randomForest, e1071, rpart, xgboost, h2o, corrplot, rpart.plot)
#pacman::p_load(tidyverse, skimr, GGally, corrplot, plotly, viridis, caret, randomForest, e1071, rpart, rattle, xgboost, h2o)

2.2 Load Dataset


wine <- read_csv("winequality-red.csv")

3 EDA



3.1 First Glimpse via skim


skim would give you the outlook of the dataset, number of observations, number of columns, the range of the variables, number of missing/ unique values, the histogram, etc.

wine %>% skim() %>% kable()
## Skim summary statistics  
##  n obs: 1599    
##  n variables: 12    
## 
## Variable type: integer
## 
## variable   missing   complete   n      mean   sd     p0   p25   p50   p75   p100   hist     
## ---------  --------  ---------  -----  -----  -----  ---  ----  ----  ----  -----  ---------
## quality    0         1599       1599   5.64   0.81   3    5     6     6     8      <U+2581><U+2581><U+2581><U+2587><U+2587><U+2581><U+2582><U+2581> 
## 
## Variable type: numeric
## 
## variable               missing   complete   n      mean    sd       p0      p25    p50     p75    p100   hist     
## ---------------------  --------  ---------  -----  ------  -------  ------  -----  ------  -----  -----  ---------
## alcohol                0         1599       1599   10.42   1.07     8.4     9.5    10.2    11.1   14.9   <U+2582><U+2587><U+2585><U+2583><U+2582><U+2581><U+2581><U+2581> 
## chlorides              0         1599       1599   0.087   0.047    0.012   0.07   0.079   0.09   0.61   <U+2587><U+2583><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## citric acid            0         1599       1599   0.27    0.19     0       0.09   0.26    0.42   1      <U+2587><U+2585><U+2585><U+2586><U+2582><U+2581><U+2581><U+2581> 
## density                0         1599       1599   1       0.0019   0.99    1      1       1      1      <U+2581><U+2581><U+2583><U+2587><U+2587><U+2582><U+2581><U+2581> 
## fixed acidity          0         1599       1599   8.32    1.74     4.6     7.1    7.9     9.2    15.9   <U+2581><U+2587><U+2587><U+2585><U+2582><U+2581><U+2581><U+2581> 
## free sulfur dioxide    0         1599       1599   15.87   10.46    1       7      14      21     72     <U+2587><U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581> 
## pH                     0         1599       1599   3.31    0.15     2.74    3.21   3.31    3.4    4.01   <U+2581><U+2581><U+2585><U+2587><U+2585><U+2581><U+2581><U+2581> 
## residual sugar         0         1599       1599   2.54    1.41     0.9     1.9    2.2     2.6    15.5   <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581> 
## sulphates              0         1599       1599   0.66    0.17     0.33    0.55   0.62    0.73   2      <U+2582><U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581> 
## total sulfur dioxide   0         1599       1599   46.47   32.9     6       22     38      62     289    <U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581> 
## volatile acidity       0         1599       1599   0.53    0.18     0.12    0.39   0.52    0.64   1.58   <U+2582><U+2587><U+2587><U+2583><U+2581><U+2581><U+2581><U+2581>

3.2 Second Glimpse via Corrplot


Corrplot would give you a overview of the correlation between all the variables. It is better to know the relationship in the very beginning of your analysis.

wine %>% cor() %>% corrplot.mixed(upper = "ellipse", tl.cex=.8, tl.pos = 'lt', number.cex = .8)


4 Preprocess


Correct column names and Turn quality variable into factor

colnames(wine) <- wine %>% colnames() %>% str_replace_all(" ","_")
wine$quality <- as.factor(wine$quality)

5 GGally - ggpairs


I have had a quick look and found the following variables: residual_sugar, free_sulfur_dioxide, total_sulfur_dioxide, and chlorides do not have significant different across different quality. Therefore, these variables are not included in the ggpairs model. Further, I found volatile_acidity, sulphates, and alcohol have more significate different across different quality based on the graph below.

The rest of the variables are showing some difference; however, the difference among different quality is not significant enough to determine all the quality directly by using one variable. In the following EDA, I leverage the power of plotly to produce a 3D interactive graph to visually see different quality at different alcohol, volatile acidity, and sulphates levels. I am expecting that we will see the trend to some extent but should not be able to directly identify the quality by using the graph.

wine %>% 
  mutate(quality = as.factor(quality)) %>% 
  select(-c(residual_sugar, free_sulfur_dioxide, total_sulfur_dioxide, chlorides)) %>% 
  ggpairs(aes(color = quality, alpha=0.4),
          columns=1:7,
          lower=list(continuous="points"),
          upper=list(continuous="blank"),
          axisLabels="none", switch="both")


6 Ployly 3D Interactive Graph


The graph is consistent with my expectation. The combination of the three variables could help us to some extent but we are still not able to identify the quality clearly. This calls for machine learning.

I am going to conduct the machine learning part with different models and cross validation to check which model produce the best result.

wine %>% 
  plot_ly(x=~alcohol,y=~volatile_acidity,z= ~sulphates, color=~quality, hoverinfo = 'text', colors = viridis(3),
          text = ~paste('Quality:', quality,
                        '<br>Alcohol:', alcohol,
                        '<br>Volatile Acidity:', volatile_acidity,
                        '<br>sulphates:', sulphates)) %>% 
  add_markers(opacity = 0.8) %>%
  layout(title = "3D Wine Quality",
         annotations=list(yref='paper',xref="paper",y=1.05,x=1.1, text="quality",showarrow=F),
         scene = list(xaxis = list(title = 'Alcohol'),
                      yaxis = list(title = 'Volatile Acidity'),
                      zaxis = list(title = 'sulphates')))

7 Cross Validation Setup


set.seed(1)
inTrain <- createDataPartition(wine$quality, p=.9, list = F)

train <- wine[inTrain,]
valid <- wine[-inTrain,]
rm(inTrain)

8 Decision Tree via rpart


The rpart plot shows alcohol, volatile acidity, and sulphates are important variables to determine the quality, which is consistent with the explorational data analysis.

# rpart
set.seed(1)
rpart_model <- rpart(quality~alcohol+volatile_acidity+citric_acid+
                   density+pH+sulphates, train)

rpart.plot(rpart_model)

#fancyRpartPlot(rpart_model)

rpart_result <- predict(rpart_model, newdata = valid[,!colnames(valid) %in% c("quality")],type='class')

confusionMatrix(valid$quality,rpart_result)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  3  4  5  6  7  8
##          3  0  0  1  0  0  0
##          4  0  0  2  3  0  0
##          5  0  0 50 18  0  0
##          6  0  0 21 38  4  0
##          7  0  0  1 13  5  0
##          8  0  0  0  1  0  0
## 
## Overall Statistics
##                                         
##                Accuracy : 0.5924        
##                  95% CI : (0.5112, 0.67)
##     No Information Rate : 0.4777        
##     P-Value [Acc > NIR] : 0.00257       
##                                         
##                   Kappa : 0.3201        
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity                NA       NA   0.6667   0.5205  0.55556       NA
## Specificity          0.993631  0.96815   0.7805   0.7024  0.90541 0.993631
## Pos Pred Value             NA       NA   0.7353   0.6032  0.26316       NA
## Neg Pred Value             NA       NA   0.7191   0.6277  0.97101       NA
## Prevalence           0.000000  0.00000   0.4777   0.4650  0.05732 0.000000
## Detection Rate       0.000000  0.00000   0.3185   0.2420  0.03185 0.000000
## Detection Prevalence 0.006369  0.03185   0.4331   0.4013  0.12102 0.006369
## Balanced Accuracy          NA       NA   0.7236   0.6115  0.73048       NA
varImp(rpart_model) %>% kable()
Overall
alcohol 117.939988
citric_acid 42.390967
density 45.500825
pH 4.487869
sulphates 85.251621
volatile_acidity 81.500436
rm(rpart_model, rpart_result)

9 Random Forest


Even without any tuning, randome forest produce a much improved result than the decision tree model.

# randomforest
set.seed(1)
rf_model <- randomForest(quality~alcohol+volatile_acidity+citric_acid+
                           density+pH+sulphates,train)
rf_result <- predict(rf_model, newdata = valid[,!colnames(valid) %in% c("quality")])

confusionMatrix(valid$quality,rf_result)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  3  4  5  6  7  8
##          3  0  0  0  1  0  0
##          4  0  0  4  1  0  0
##          5  0  0 54 14  0  0
##          6  0  0 17 40  6  0
##          7  0  0  1  6 11  1
##          8  0  0  0  0  0  1
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6752          
##                  95% CI : (0.5959, 0.7476)
##     No Information Rate : 0.4841          
##     P-Value [Acc > NIR] : 1.028e-06       
##                                           
##                   Kappa : 0.475           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity                NA       NA   0.7105   0.6452  0.64706 0.500000
## Specificity          0.993631  0.96815   0.8272   0.7579  0.94286 1.000000
## Pos Pred Value             NA       NA   0.7941   0.6349  0.57895 1.000000
## Neg Pred Value             NA       NA   0.7528   0.7660  0.95652 0.993590
## Prevalence           0.000000  0.00000   0.4841   0.3949  0.10828 0.012739
## Detection Rate       0.000000  0.00000   0.3439   0.2548  0.07006 0.006369
## Detection Prevalence 0.006369  0.03185   0.4331   0.4013  0.12102 0.006369
## Balanced Accuracy          NA       NA   0.7688   0.7015  0.79496 0.750000

After reviewing the result, let’s look at which variable contributes the most.

varImp(rf_model) %>% kable()
Overall
alcohol 194.0940
volatile_acidity 158.6520
citric_acid 127.2520
density 158.2908
pH 133.5067
sulphates 154.0123
varImpPlot(rf_model)

rm(rf_model, rf_result)

10 SVM


SVM provides an alternative solution; however, the result is not outstanding.

# svm
set.seed(1)
svm_model <- svm(quality~alcohol+volatile_acidity+citric_acid+
                           density+pH+sulphates,train)
svm_result <- predict(svm_model, newdata = valid[,!colnames(valid) %in% c("quality")])

confusionMatrix(valid$quality,svm_result)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  3  4  5  6  7  8
##          3  0  0  1  0  0  0
##          4  0  0  3  2  0  0
##          5  0  0 57 11  0  0
##          6  0  0 20 41  2  0
##          7  0  0  1 12  6  0
##          8  0  0  0  1  0  0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6624          
##                  95% CI : (0.5827, 0.7359)
##     No Information Rate : 0.5223          
##     P-Value [Acc > NIR] : 0.0002612       
##                                           
##                   Kappa : 0.4339          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity                NA       NA   0.6951   0.6119  0.75000       NA
## Specificity          0.993631  0.96815   0.8533   0.7556  0.91275 0.993631
## Pos Pred Value             NA       NA   0.8382   0.6508  0.31579       NA
## Neg Pred Value             NA       NA   0.7191   0.7234  0.98551       NA
## Prevalence           0.000000  0.00000   0.5223   0.4268  0.05096 0.000000
## Detection Rate       0.000000  0.00000   0.3631   0.2611  0.03822 0.000000
## Detection Prevalence 0.006369  0.03185   0.4331   0.4013  0.12102 0.006369
## Balanced Accuracy          NA       NA   0.7742   0.6837  0.83138       NA
rm(svm_model, svm_result)

11 xgboost


xgboost with a little bit hyper-parameter tuning achieved the best result among the models so far.

# xgboost
data.train <- xgb.DMatrix(data = data.matrix(train[, !colnames(valid) %in% c("quality")]), label = train$quality)
data.valid <- xgb.DMatrix(data = data.matrix(valid[, !colnames(valid) %in% c("quality")]))


parameters <- list(
  # General Parameters
  booster            = "gbtree",      
  silent             = 0,           
  # Booster Parameters
  eta                = 0.2,              
  gamma              = 0,                 
  max_depth          = 5,                
  min_child_weight   = 2,            
  subsample          = 1,                 
  colsample_bytree   = 1,                
  colsample_bylevel  = 1,          
  lambda             = 1,    
  alpha              = 0,       
  # Task Parameters
  objective          = "multi:softmax",   # default = "reg:linear"
  eval_metric        = "merror",
  num_class          = 7,
  seed               = 1               # reproducability seed
)

xgb_model <- xgb.train(parameters, data.train, nrounds = 100)

xgb_pred <- predict(xgb_model, data.valid)

confusionMatrix(as.factor(xgb_pred+2), valid$quality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  3  4  5  6  7  8
##          3  0  0  0  0  0  0
##          4  0  0  1  1  0  0
##          5  0  4 55 12  2  0
##          6  1  1 12 45  5  0
##          7  0  0  0  4 11  0
##          8  0  0  0  1  1  1
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7134          
##                  95% CI : (0.6359, 0.7826)
##     No Information Rate : 0.4331          
##     P-Value [Acc > NIR] : 1.144e-12       
##                                           
##                   Kappa : 0.5399          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000  0.00000   0.8088   0.7143  0.57895 1.000000
## Specificity          1.000000  0.98684   0.7978   0.7979  0.97101 0.987179
## Pos Pred Value            NaN  0.00000   0.7534   0.7031  0.73333 0.333333
## Neg Pred Value       0.993631  0.96774   0.8452   0.8065  0.94366 1.000000
## Prevalence           0.006369  0.03185   0.4331   0.4013  0.12102 0.006369
## Detection Rate       0.000000  0.00000   0.3503   0.2866  0.07006 0.006369
## Detection Prevalence 0.000000  0.01274   0.4650   0.4076  0.09554 0.019108
## Balanced Accuracy    0.500000  0.49342   0.8033   0.7561  0.77498 0.993590
rm(xgb_model, xgb_pred, data.train, data.valid, parameters)

12 h2o (deeplearning)


h2o is one of the easiest tool to apply deep learning model; however, due to the size of the dataset, the deep learning does not outperform the xgboost model.

# h2o
h2o.init()
h2o.train <- as.h2o(train)
h2o.valid <- as.h2o(valid)


h2o.model <- h2o.deeplearning(x = setdiff(names(train), c("quality")),
                              y = "quality",
                              training_frame = h2o.train,
                              # activation = "RectifierWithDropout", # algorithm
                              # input_dropout_ratio = 0.2, # % of inputs dropout
                              # balance_classes = T,
                              # momentum_stable = 0.99,
                              # nesterov_accelerated_gradient = T, # use it for speed
                              epochs = 1000,
                              standardize = TRUE,         # standardize data
                              hidden = c(100, 100),       # 2 layers of 00 nodes each
                              rate = 0.05,                # learning rate
                              seed = 1                # reproducability seed
)

h2o.predictions <- h2o.predict(h2o.model, h2o.valid) %>% as.data.frame()
confusionMatrix(h2o.predictions$predict, valid$quality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  3  4  5  6  7  8
##          3  0  0  1  0  0  0
##          4  0  0  2  1  0  0
##          5  0  4 51 14  3  0
##          6  1  1 12 43  3  0
##          7  0  0  2  5 12  0
##          8  0  0  0  0  1  1
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6815          
##                  95% CI : (0.6025, 0.7535)
##     No Information Rate : 0.4331          
##     P-Value [Acc > NIR] : 2.987e-10       
##                                           
##                   Kappa : 0.4966          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000  0.00000   0.7500   0.6825  0.63158 1.000000
## Specificity          0.993590  0.98026   0.7640   0.8191  0.94928 0.993590
## Pos Pred Value       0.000000  0.00000   0.7083   0.7167  0.63158 0.500000
## Neg Pred Value       0.993590  0.96753   0.8000   0.7938  0.94928 1.000000
## Prevalence           0.006369  0.03185   0.4331   0.4013  0.12102 0.006369
## Detection Rate       0.000000  0.00000   0.3248   0.2739  0.07643 0.006369
## Detection Prevalence 0.006369  0.01911   0.4586   0.3822  0.12102 0.012739
## Balanced Accuracy    0.496795  0.49013   0.7570   0.7508  0.79043 0.996795
rm(h2o.model, h2o.train, h2o.valid, h2o.predictions)

13 Conclusion


As I expected, xgboost give the best outcome among all the models in this kernel. A better hyper-parameter tuned xgboost model/ lightgbm would potientially produce a better result. Additionalluy, an ensemble model might also potentially give improved result.

If you have any question, please leave a comment and if you like the kernel, please give a upvote~ Thanks!